A genomic instability-associated lncRNA signature for predicting prognosis and biomarkers in lung adenocarcinoma

Genomic instability (GI) was associated with tumorigenesis. However, GI-related lncRNA signature (GILncSig) in lung adenocarcinoma (LUAD) is still unknown. In this study, the lncRNA expression data, somatic mutation information and clinical survival information of LUAD were downloaded from The Cancer Genome Atlas (TCGA) and performed differential analysis. Functional and prognosis analysis revealed that multiple GI-related pathways were enriched. By using univariate and multivariate Cox regression analysis, 5 GI-associated lncRNAs (AC012085.2, FAM83A-AS1, MIR223HG, MIR193BHG, LINC01116) were identified and used to construct a GILncSig model. Mutation burden analysis indicated that the high-risk GI group had much higher somatic mutation count and the risk score constructed by the 5 GI-associated lncRNAs was an independent predictor for overall survival (OS) (P < 0.05). Overall, our study provides valuable insights into the involvement of GI-associated lncRNAs in LUAD and highlights their potential as therapeutic targets.

Lung cancer, is the most common malignant tumor and is the leading cause of cancer-related morbidity and mortality worldwide 1 .Non-small-cell lung carcinoma (NSCLC), including lung squamous cell carcinoma and lung adenocarcinoma (LUAD), accounts for approximately 85% of lung cancer 2 .Despite advances in clinical and experimental oncology, the early diagnostic and prognosis biomarkers of NSCLC is still unsatisfactory 3 .Therefore, there is an urgent need to identify new biomarkers to predict the clinical outcome of NSCLC patients.
Smoking, exposure to environmental tobacco smoking, residential radon, cooking oil fumes and particulate matter 2.5 (PM2.5) are main environmental risk factors associated to the occurrence of lung cancer 4,5 .Smoking is the major risk factor for lung cancer, accounting for about 90% of male and 70% of female 6 .Smoking causes multiple alterations to cells and tissues, including DNA single-strand breaks 7 , chromosome exchanges 8 and chromosome instability (CIN) 9 , which can lead to genomic instability (GI).GI, mainly including CIN and microsatellite instability (MSI), has been an important hallmark of various human cancers [10][11][12] .GI has been proven to be closely related to the clinical diagnosis and prognosis of multiple malignant tumors 12,13 .Recent studies have clarified that GI can be used as a prognosis marker in cervical cancer 14 and breast cancer 15 .Emerging evidence revealed that GI was associated with tumorigenesis in lung cancer 16,17 , however, the potential role and mechanism of GI in lung cancer need further be explored.
LncRNAs are defined as nonprotein coding transcripts more than 200 nucleotides in length 18 .Accumulating evidence suggested lncRNAs involve in gene expression at epigenetic, transcription and post-transcriptional levels [19][20][21] .Non-coding RNA activated by DNA damage (NORAD) and long non coding transcriptional activator of miR34a (GUARDIN) could maintain genomic stability by involving in DNA replication and repair in LUAD and colon cancer 22,23 .LncRNA DDSR1, a DNA damage-sensitive RNA1, modulated DNA repair by regulating
Samples from patients with overall survival (OS) of < = 30 days were excluded 25 .In the end, a total of 490 LUAD patients with paired lncRNA and mRNA expression data, somatic mutation information and clinicopathological characteristics were enrolled in our study to build the GILncSig model.
To increase the reliability of our research, we randomly and equally divided the entire dataset into a training set (n = 246), a validation set (n = 244) and the whole dataset was considered as a combination set (n = 490).The workflow of this work was shown in Fig. 1.Clinicopathological features of the 490 LUAD patients were shown in Table 1.

Identification of GI-associated lncRNAs
To obtain GI-associated lncRNAs, as the method described by Bao et al. 26 : (a) the cumulative number of somatic mutations for each patient was calculated; (b) patients were arranged in a decreasing order based on their cumulative number of somatic mutations; (c) those in the top 25% of patients were categorized as genomic unstable (GU)-like team, while those in the last 25% were classified as genomically stable (GS)-like team; (d) expression profiles of lncRNAs between the GU-like team and GS-like team were compared using significance analysis of microarrays (SAM) method; (e) differentially expressed lncRNAs (log|fold change|> 1 and false discovery rate (FDR) adjusted P < 0.05) were defined as genome instability-associated lncRNAs.

Evaluation of risk score
By linearly combining the expression value of GI-associated lncRNAs weighted by their coefficients, a risk-score formula was constructed as following 27 :

Co-expression network
To measure the correlation between lncRNAs and mRNAs, Pearson correlation coefficients was conducted and the top 10 mRNAs were considered as co-expressed lncRNA-associated partners, a lncRNAs-mRNAs coexpression network was constructed.

Functional enrichment analysis
To reveal the potential function of the co-expressed lncRNAs and mRNAs, Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) functional enrichment analysis of lncRNA-correlated PCGs were performed using clusterProfiler soft-ware.

Building and validation of a Nomogram
To know the prognosis value of lncRNA, as the method employed by Iasonos et al. 28 , a nomogram was constructed by including the expression of the 5 lncRNAs in the combination set and a total score of 1-year, 2-year, and 3-year OS were evaluated.Calibration plot was further performed to evaluate the calibration and the discrimination of the nomogram by a bootstrap method with 1000 resamples.

Statistical analysis
Hierarchical cluster analysis was performed using Euclidean distances and Ward's linkage method.Kaplan-Meier analysis was used to calculate the OS.Univariate Cox and Multivariate Cox regression and stratified analysis were used to verify the independence of the GILncSig from other clinical factors and investigate the time-dependent prognostic value of the GILncSig in cancers.Hazard ratio (HR) and 95% confidence interval (CI) were calculated by Cox analysis.Receiver operating characteristic (ROC) was used to investigate the time-dependent prognostic value of the GILncSig.Principal component analysis (PCA) was performed to study the expression patterns in the different groups.All statistical analyses were performed using R-version 4.0.3.A P-value of less than 0.05 was considered statistically significant.

Identification of GI-related lncRNAs in LUAD patients
The cumulative number of somatic mutations in each patient was calculated and ranked to identify GI-associated lncRNAs.The first 25% of patients (n = 134) were assigned to GU-like team and the last 25% to GS-like team (n = 139) (Table S5).To find lncRNAs with significant differences, mRNA and lncRNA expression profiles (Tables S6-7) in each team were compared.With log|fold change|> 1 and FDR-adjusted P-value < 0.05, a total of 185 lncRNAs (candidate genomic instability-related lncRNAs) were considered to be significantly differentially expressed (Table S8).Hierarchical clustering analysis was conducted on the 535 samples in the TCGA set.Through the expression of the 185 differentially expressed lncRNAs, all 535 samples were clustered into two groups (Fig. 2A and Table S9).The group with higher cumulative somatic mutations was defined as GU-like group and the other GS-like group.The count of somatic cumulative mutations, deletion mutation and gene amplifications in the GU-like group was significantly higher than that in the GS-like group (p < 0.001, Fig. 2B-D).
To further explore the potential functions and pathways of the above 185 lncRNAs in LUAD, functional enrichment analysis was analyzed.The 10 most relevant protein-coding genes (PCGs) for each lncRNA among the 185 lncRNAs was found by using the method of co-expression analysis, then the lncRNAs-mRNAs coexpression network was constructed.In the network, the nodes were lncRNAs and mRNAs, they linked together if they were related to each other (Fig. 2E).The regulating effects of lncRNAs and mRNAs in the network were characterized in Table S10, all lncRNAs and mRNAs had cis regulating effects.GO and KEGG analysis of lncRNA-correlated PCGs revealed that mRNAs in this network were significantly associated with GI such as cilium movement, microtubule bundle formation, motile cilium, chromosomal region, microtubule motor activity and cell cycle (Fig. 2F and G).These results suggested that the 185 differentially expressed lncRNAs affected GI through lncRNA-related PCGs regulatory network and they can be candidate biomarkers for GIassociated lncRNAs.

Identification of a GILncSig for prognostic validation in the training set
Univariate Cox regression analysis of the 185 differentially expressed lncRNAs in the training set showed that 7 GI-associated lncRNAs (LINC02587, AC026785.3,AC012085.2,FAM83A-AS1, MIR223HG, MIR193BHG, LINC01116) had the significant prognostic value in LUAD patients (p < 0.05, Fig. 3A).The correlation analysis indicated 7 GI-associated lncRNAs significantly interacted with each other (Fig. 3B). 5 GI-associated lncRNAs (AC012085.2,FAM83A-AS1, MIR223HG, MIR193BHG, LINC01116) were identified by multivariate Cox regression analysis and then were further used to develop the GILncSig model (Fig. 3C).In the GILncSig, AC012085.2,FAM83A-AS1, MIR193BHG, LINC01116 acted as risk factors for LUAD, and MIR223HG acted as a protective factor.The coefficients were shown in Table 2.The risk score of each patient was calculated by the sum of the coefficients of each lncRNA multiplied by the corresponding expression in each patient.Based on the median risk score, the training set was classified into high-risk group (n = 123) and low-risk group (n = 123).As shown in Fig. 3D, the expression level of the risk lncRNAs (AC012085.2,FAM83A-AS1, MIR193BHG, LINC01116) was upregulated, while the protective MIR223HG was downregulated, and the count of somatic mutations was positively correlated with the patient's risk score.The count of somatic mutation of patients in the low-risk group was significantly lower than that of patients in the high-risk group (p < 0.001, Fig. 3E), and our results were similar to those of Matuno et al. 29 .The Kaplan-Meier curve indicated that the patients in the high-risk group had a poorer OS than those in the low-risk group (p < 0.001, Fig. 3F).The time-dependent ROC curve analysis of GILncSig for 1-, 2-, and 3-year OS were 0.785, 0.731 and 0.759 respectively (Fig. 3G).PCA analysis showed low-and high-risk groups were significantly distributed in two different directions, indicating that the LUAD patients in the low-risk group was quite distinguished from those in the high-risk group (Fig. S1A).

Validation of the GILncSig
To further verify the accuracy of the GILncSig model, the 5 coefficients were applied to the validation set (n = 244) and the combination set (n = 490) to confirm the risk score of each patient, then the validation set was classified into the high-risk group (n = 131) and low-risk group (n = 113) and the combination set was classified into the high-risk group (n = 254) and low-risk group (n = 235).In the validation set, with the increase of patient's risk score, the expression level of the risk lncRNAs was also upregulated, the protective lncRNAs was downregulated and the count of somatic mutations also increased (Fig. 4A).The somatic mutation count of patients in the lowrisk group was lower than that of patients in the high-risk group (p < 0.001, Fig. 4B).Patients in the low-risk group had better OS rates than in the high-risk group (p = 0.027, Fig. 4C).The ROC curves showed that the AUCs for the 1-, 2-, and 3-year OS were 0.676, 0.590, and 0.576 (Fig. 4D).PCA showed that the low-and high-risk groups were divided into two different clusters (Fig. S1B).Similarly, the results were also validated in the combination set (p < 0.001, Figs.5A-D and S1C).
Univariate Cox regression analysis was performed using the survival package to investigate the time-dependent prognostic value of the GILncSig in cancers.The results indicate that the GILncSig is a significant risk factor for Adrenocortical carcinoma (ACC), Bladder Urothelial Carcinoma (BLCA), Colon adenocarcinoma (COAD), Kidney renal papillary cell carcinoma (KIRP), Brain Lower Grade Glioma (LGG), LUAD, Pancreatic adenocarcinoma (PAAD), Stomach adenocarcinoma (STAD) and Thymoma (THYM) (Fig. S2 and Table S11).

Somatic mutation types in the combination set
To explore the mutation types of LUAD, the R package "maftools" was used to visualize mutation data from 490 samples in TCGA-LUAD.In the high-risk group, missense mutation was predominant in variant classification (Fig. S3A and S3E); SNP was the most frequent in variant type (Fig. S3B); and the mutation from C to A was the most prevalent in the SNV class (Fig. S3C).Additionally, the median number of variants per sample was 226 (Fig. S3D).The top 10 mutated genes included TTN, MUC16, CSMD3, RYR2, TP53, LRP1B, ZFHX4, USH2A, FLG, and KRAS (Fig. S3F).Similar results were also found in the low-risk group (Fig. S3G-L).The low-risk group had significant higher arm-level amplification and deletion frequencies than the low-risk group (P < 0.05) in the combination set (Fig. S3M).In conclusion, we observed a distinct pattern in the occurrence of mutations during the progression of LUAD.(F, G) Functional enrichment analysis of GO (F) and KEGG (G) for co-expressed lncRNAs and mRNAs.***p < 0.001; LUAD, lung adenocarcinoma; GI, genomic instability; GU, genetic unstable; GS, genetic stable; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Independence of the GILncSig from other clinical factors
To identify the independent prognostic value of GILncSig, univariate and multivariate Cox regression analysis were performed on age, sex, TNM stage, and the risk signature.The tumor TNM stage is determined based on the size of the tumor (Tumor), involvement of lymph nodes (Node), and presence of metastasis (Metastasis) to assess the severity and prognosis of the tumor.The results indicated that the risk signature and TNM stage were independent factors when adjusted for age, sex and smoking in all sets (Table 3).www.nature.com/scientificreports/ In the multivariate Cox regression analysis, TNM stage were also identified as independent prognostic factor.Subsequently, a stratification analysis was performed to evaluate whether the GILncSig could predict patient survival within the same clinical factor subgroup.Patients in the combination group were stratified based on clinical parameters, such as age (< = 65/ > 65), sex (female/male), stage (I + II/III + IV).The results showed that the GILncSig could classify patients of the same stratum of age, sex, and stage into high-and low-risk groups.Patients in the high-risk group had a poorer OS than those with low-risk group in each stratum (Fig. 6A-F.These results indicated that the GILncSig was an independent prognostic factor related to the OS in LUAD.

Construction of a nomogram based on the GILncSig in the combination set
To construct a quantitative method for the prognosis of LUAD patients, we integrated the 5 GI-associated lncRNAs to establish a nomogram (Fig. 6G).The calibration curve for the nomogram indicated that using the nomogram to predict OS was highly consistent with actual OS (Fig. 6H).

Discussion
Since lung cancer has no symptoms in the early stage, majority of patients are already in the advanced stage when they are discovered.Although traditional treatments are constantly improving, the five-year survival rate of lung cancer is only 19% 30 .GI has been reported in various malignant cancers, including lung cancers 11,14,16,17,[31][32][33][34][35] .CIN is the major type of GI in lung cancer, which leads to the high gene mutation burden by chromosome structure and number alterations in cancer cells 11 .In this study, we found that a total of 185 GI-associated lncRNAs were significantly differentially expressed.Functional analysis revealed that GI-associated pathways were significantly enriched, which indicates that GI-associated lncRNAs may be associated with tumorigenesis. 5 GI-associated  www.nature.com/scientificreports/Meanwhile, there are several limitations to our study.(1) Due to the limited lncRNA chip of LUAD, we could only divide the TCGA data set into training set and validation set randomly and more independent data sets are needed to validate the GILncSig to ensure its robustness and reproducibility.(2) In vivo and in vitro experiments were not performed to verify the role of the GILncSig model, therefore, subsequent experiments are needed to validate the reliability of results.(3) As an observational study, confounding factors, such as pharmaceutical treatment, exposure to environmental tobacco smoking, residential radon, cooking oil fumes, PM2.5 and so on, may have an impact on our results.

Conclusions
In summary, in this study, we constructed a novel GILncSig model that may involve in the progression and prognosis in LUAD.Targeting GI-related lncRNAs may be a potential application for LUAD therapy.However, the underlying mechanisms involving in GI need be further explored.

Figure 2 .
Figure 2. The identification of long non-coding RNAs related to GI and subsequent functional enrichment analysis.(A) Clustering analysis of 535 LUAD patients based on the expression of 185 candidate genomic instability-related lncRNAs.The left cluster is GU-like group, and the right cluster is GS-like group.The x-axis represents lung adenocarcinoma patients.The y-axis represents 185 candidate genomic instability-related lncRNAs.(B-D) Boxplots of somatic mutations (B), deletion mutation (C) and gene amplifications (D) in the GU-like group and GS-like group.(E) Co-expression network of GI-related lncRNAs and mRNAs based on the Pearson correlation coefficient.The blue circles represent lncRNAs, and the blue circles represent mRNAs.(F,G) Functional enrichment analysis of GO (F) and KEGG (G) for co-expressed lncRNAs and mRNAs.***p < 0.001; LUAD, lung adenocarcinoma; GI, genomic instability; GU, genetic unstable; GS, genetic stable; GO, Gene Ontology; KEGG, Kyoto Encyclopedia of Genes and Genomes.

Figure 3 .
Figure 3. Identification of the GILncSig in the training set.(A) The forest plot of univariate cox regression identified 7 GI-associated lncRNAs.(B) The correlation analysis among the 7 GI-associated lncRNAs.(C) The forest plot of multivariate Cox regression analysis of 5 GI-associated lncRNAs.(D) The expression and somatic mutation count with increasing risk score of the GILncSig.(E) Boxplot of somatic mutation count in the highand low-risk groups.(F) Kaplan-Meier survival curve of the high-and low-risk groups.(G) Time-dependent ROC curves and area AUC for 1-, 2-, and 3-year OS. *** p < 0.001; GI, genomic instability; ROC receiver operating characteristic; AUC, under the curve.

Figure 4 . 3 Figure 5 .
Figure 4.The prognostic values of the GILncSig in the validation set.(A) The expression and somatic mutation count with increasing risk score of the GILncSig.(B) Boxplot of somatic mutation count in the high-and lowrisk groups.(C) Kaplan-Meier survival curve of the high-and low-risk groups.(D) ROC curves and AUC for 1-, 2-, and 3-year OS. ***p < 0.001; ROC receiver operating characteristic; AUC, under the curve; OS, overall survival.
The flowchart depicting the process of data collection and analysis.

Table 1 .
Clinicopathological features of lung adenocarcinoma patients in the training set, validation set and combination set.

Table 2 .
5 prognostic GI-associated lncRNAs identified from univariate and multivariate Cox regression analysis.

Table 3 .
Univariate and multivariate Cox regression analysis of the GILncSig and overall survival in each set.Significant values are in [bold].